Also called “information gain", is a measure of the difference between two probability distributions P and Q. It is not symmetric and does not obey the triangle inequality, thus is not a true metric.
KL divergence from Q to P:

D K L (P ∥ Q) = \int p (x) log p ( x ) q ( x ) d x = E p [log p ( x ) q ( x )]

$D_{\mathrm {KL} }(P\|Q)=\int p(x)\log {\frac {p(x)}{q(x)}}\,{\rm {d}}x= \operatorname E_p\left[\log{\frac {p(x)}{q(x)}} \right]$
In information theory,

DKL(P∥Q) $D_{\mathrm {KL} }(P\|Q)$

it is the amount of information lost when Q is used to approximate P
it measures the expected number of extra bits required to code samples from P using a code optimized for Q

Proof for $D_{\mathrm{KL}}(Q\|P)\geq 0$ (also for $D_{\mathrm{KL}}(P\|Q)\geq 0$ ) :

0 = log 1 = log \int p (x) d x = log \int p (x) q ( x ) q ( x ) d x \geq \int q (x) log p ( x ) q ( x ) = E q [log p ( x ) q ( x )] = - D K L (Q ∥ P)

$\begin{align} 0 &{}=\log 1=\log\int p(x)\mathrm{d}x=\log\int p(x)\frac{q(x)}{q(x)}\mathrm{d}x \\ &{}\geq\int q(x)\log\frac{p(x)}{q(x)}=\operatorname E_q\left[\log\frac{p(x)}{q(x)} \right] =-D_{\mathrm{KL}}(Q\|P) \end{align}$
Due to Jensen's inequality:

f (E [x]) \geq E [f (x)], if f is concave

$f(\operatorname E[x])\geq\operatorname E[f(x)]\text{, if }f\text{ is concave}$
Note that

DKL(Q∥P)=0 $D_{\mathrm{KL}}(Q\|P)= 0$ iff

q(x)=p(x) $q(x)=p(x)$ .

If P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while Q represents a theory, model, description, or approximation of P:

Optimize KL(Q||P): zero-forcing, underestimate (better choice, get at least local optimum)
Optimize KL(P||Q): zero-avoiding, overestimate (the output expectation value is not good at all)

Reference

Introduction to variational Bayesian methods: https://www.youtube.com/watch?v=HOkkr4jXQVg
KL Divergence: https://en.wikipedia.org/wiki/Kullback–Leibler_divergence